Visual inter-word relations and their use in OCR postprocessing

نویسندگان

  • Tao Hong
  • Jonathan J. Hull
چکیده

A technique is presented that uses visual relationships between word images in a document to improve the recognition of the text it contains. This technique takes advantage of the visual relationships between word images that are usually lost in most conventional optical character recognition (OCR) techniques. The visual relations are defined to be the equivalence that exists between images of the same word or portions of word images. An algorithm is presented that calculates these relationships in a document. The resulting clusters are integrated with the recognition results provided by an OCR system. Inconsistencies in OCR results between equivalent images are identijied and used to improve recognition performance. Experimental results are presented in which the input is provided directly from a commercial OCR system.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Algorithms for postprocessing OCR results with visual inter-word constraints

Algorithms are presented that determine the visual relationships between word images in a document. These include instances of common word images and common substrings that occur often in English language text images. This information is then be used to improve the performance of a commercial optical character recognition (OCR) algorithm. The algorithms presented here calculate clusters of equi...

متن کامل

A Uniied Approach towards Text Recognition

In our recent research, we found that visual inter-word relations can be useful for diierent stages of English text recognition such as character segmentation and postprocessing. Diierent methods had been designed for diierent stages. In this paper, we propose a uniied approach to use visual contextual information for text recognition. Each word image has a lattice, which is a data structure to...

متن کامل

The Postprocessing of Optical Character Recognition Based on Statistical Noisy Channel and Language Model

The techniques of image processing have been used in optical character recognition (OCR) for a long time. The recognition method evolved from early "pattern recognition" to "feature extraction" recently. The recognition rate is raised from 70% to 90%. But the character by character recognition technique has its limitation. Using language models to assist the OCR system in improving recognition ...

متن کامل

A Statistical Approach to Automatic OCR Error Correction in Context

This paper describes an automatic, context-sensitive, word-error correction system based on statistical language modeling (SLM) as applied to optical character recognition (OCR) postprocessing. The system exploits information from multiple sources, including letter n-grams, character confusion probabilities, and word-bigram probabilities. Letter n-grams are used to index the words in the lexico...

متن کامل

Multifont OCR Postprocessing System

A series of techniques is being developed to postprocess noisy, multifont, nonformatted OCR data on a word basis to 1 ) determine if a field is alphabetic or numeric; 2) verify that an alphabetic word is legitimate; 3 ) fetch from a dictionary a set of potential entries using a garbled word as a key; and 4) error-correct the garbled word by selecting the most likely dictionary word. Four algori...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995